Bad weather Kelheim Demo

Oleksandr Soboliev

Data

Data is taken from these resources:

Regression analysis resources

Analysis was proceeded using this statistical sources:

Importing and preparing data

Main target is to get result table containing all “expected” relevant data/variables that can be connected with changes of demand in Kelheim region, thus mobility data is collected on daily basis, remaining data has to be daily based.

After obtaining such table it would be useful(in sense of building a model and understanding the data) to check correlations between individual independent variable to dependent mobility variable. So the next chapter has a focus on filtering out unrelevant data to pass it to the model.

##Modify and join data

# Weatherstack
weatherstack_kelheim_daily = weatherstack_kelheim %>%
  group_by(date) %>%
  count(description)

# Stringency 
deu_stringency = json[grep("DEU.stringency_actual",names(json))]
date_stringency = sapply(strsplit(names(deu_stringency),split = ".",fixed = TRUE),"[[",2)
df_stringency = data.frame(date = date_stringency,stringency = deu_stringency)
df_stringency = df_stringency %>% mutate(stringency = as.numeric(stringency), date = as.Date(date))



# Ingolstadt
type_of_weather = unique(weatherstack_kelheim$description)
map_vector <- c("Clear","Sunny","Cloudy","Light","Light","Light","Light","Light","Light","Light","Light","Medium","Cloudy","Light","Light","Heavy","Heavy","Heavy","Light","Medium","Heavy","Heavy","Light","Heavy","Heavy","Heavy","Heavy","Heavy","Heavy","Light","Medium","Medium","Light","Heavy","Light","Light","Light","Light","Light","Heavy","Light","Medium","Heavy","Heavy","Heavy")
names(map_vector)<- type_of_weather




ingolstadt_weather = ingolstadt_weather %>% 
  mutate(season = ifelse(month(date) %in% c(12,1,2),"winter",NA)) %>%
  mutate(season = ifelse(month(date) %in% c(3,4,5),"spring",season)) %>%
  mutate(season = ifelse(month(date) %in% c(6,7,8),"summer",season)) %>%
  mutate(season = ifelse(month(date) %in% c(9,10,11),"autumn",season))# %>% dplyr::select(-tsun)




day_description_impact = weatherstack_kelheim_daily %>% pivot_wider(names_from = description,values_from = n)

#remove NAs
day_description_impact[is.na(day_description_impact)] = 0

day_description_impact = day_description_impact %>% pivot_longer(cols = all_of(type_of_weather),names_to = "description",values_to = "value")

day_description_impact = day_description_impact
day_description_impact$description = map_vector[(day_description_impact$description)]

day_description_impact= day_description_impact %>% group_by(date)%>%
  top_n(n = 1,value) %>% group_by(date) %>% top_n(n = 1,description) %>% rename(weather_impact = value)

#####Join the data#####

result_data = demand %>% left_join(day_description_impact, by = "date") %>% inner_join(ingolstadt_weather,by = "date") %>% inner_join(df_stringency,by = "date") %>% mutate(date = as.Date(date,format = "%Y-%m-%d"))
#Also need to be added weekday
result_data = result_data %>% mutate(wday = as.character(wday(date,week_start = 1)))

#Append holidays
result_data = result_data %>% left_join(df_holidays, by = "date") %>% replace_na(list(isHoliday = FALSE,snow = 0)) #%>% filter(noRides != 0) #%>% filter(date <"2021-07-01")

head(result_data)
## # A tibble: 6 × 21
##   date       noRides noRequests avgEucl…¹ avgTr…² descr…³ weath…⁴  tavg  tmin  tmax  prcp  snow  wdir  wspd  wpgt  pres
##   <date>       <dbl>      <dbl>     <dbl>   <dbl> <chr>     <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020-06-24       0          4         0       0 Cloudy       18  18.9  10.8  24.3   0       0    73  10.4  42.5 1022.
## 2 2020-06-25       0          8         0       0 Cloudy       18  18    12.6  23.6   0       0    79   7.2  31.3 1019.
## 3 2020-06-26       0          7         0       0 Cloudy       15  19.3  10.2  26.7   0       0    82   6.8  18.5 1015.
## 4 2020-06-27       0          2         0       0 Cloudy       11  21.5  16.1  29.3   0.7     0   235   7.6  57.2 1014.
## 5 2020-06-28       0          6         0       0 Medium        9  20.7  14.9  25.4  12.1     0   307   8.3  44.3 1015.
## 6 2020-06-29       0         31         0       0 Medium        6  17.2  14.3  22.1   2.1     0   283   7.2  27.7 1015.
## # … with 5 more variables: tsun <lgl>, season <chr>, stringency <dbl>, wday <chr>, isHoliday <lgl>, and abbreviated
## #   variable names ¹​avgEuclidianDistance_m, ²​avgTravelTime_s, ³​description, ⁴​weather_impact

From following plots we can see strong correlation between day of the week and number of Rides, because we want to simulate common day using MatSim it was decided to filter weekends out of resulting dataset for the model. Also because day doesn’t represent weather impact, like holidays that have strong impact on mobility so they are also excluded. First 4 days are excluded, because they have 0 rides due to KeXi service start.

wday_plot = ggplotly(ggplot(result_data)+
  geom_boxplot(aes(x = wday,y = noRides)))

holiday_plot = ggplotly(ggplot(result_data)+
  geom_boxplot(aes(x = isHoliday,y = noRides )))

annotations = list( 
  list( 
     x = 0.2,  
    y = 1.0,  
    text = "Weekday",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ),  
  list( 
     x = 0.75,  
    y = 1.0,  
    text = "Is Holiday",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ))

subplot(wday_plot,holiday_plot) %>% layout(annotations = annotations)

result_data = result_data %>% filter(wday!=6 & wday!=7,isHoliday == FALSE, noRides!=0)

After first data processing it would be helpful to find some dependencies in the data using scatter plots mapped to number of rides. Here is summary of end dataset

result_data$description = factor(result_data$description)
result_data$season = factor(result_data$season)
result_sum  = data.frame(c("noRides","description","weather_impact","tavg","tmin","tmax","prcp","snow","wspd","wpgt","pres"),c("Number of rides in day (dependent variable)","Weather description - the type of the weather with highest absolute duration among descriptions during a day","Number of hours of selected description with maximal hours a day","The average air temperature in °C","The minimum air temperature in °C   ","The maximum air temperature in °C","The daily precipitation total in mm","The maximum snow depth in mm","The average wind speed in km/h","The peak wind gust in km/h","The average sea-level air pressure in hPa"),c("Mean: 80.2","Clear, Cloudy, Heavy, Light, Medium, Sunny","Mean: 12 °C","Mean: 10.37 °C","Mean: 5.81 °C","Mean: 15.06","Mean: 1.76","Mean: 0.2348","Mean: 8.6 km/h","Mean: 32.75 km/h","Mean: 1019.3 hPa"))
colnames(result_sum) = c("Variable","Description","Stat")
knitr::kable(result_sum)
Variable Description Stat
noRides Number of rides in day (dependent variable) Mean: 80.2
description Weather description - the type of the weather with highest absolute duration among descriptions during a day Clear, Cloudy, Heavy, Light, Medium, Sunny
weather_impact Number of hours of selected description with maximal hours a day Mean: 12 °C
tavg The average air temperature in °C Mean: 10.37 °C
tmin The minimum air temperature in °C Mean: 5.81 °C
tmax The maximum air temperature in °C Mean: 15.06
prcp The daily precipitation total in mm Mean: 1.76
snow The maximum snow depth in mm Mean: 0.2348
wspd The average wind speed in km/h Mean: 8.6 km/h
wpgt The peak wind gust in km/h Mean: 32.75 km/h
pres The average sea-level air pressure in hPa Mean: 1019.3 hPa

Finding patterns using graphic approach

tavg_plot = ggplotly(
  ggplot(result_data)+
    geom_point(aes(y= noRides,x = tavg,colour = season))
  )%>%layout(xaxis = list(title = 'tavg'), yaxis = list(title = 'noRides'))


pres_plot= ggplotly(
  ggplot(result_data)+
    geom_point(aes(y= noRides,x = pres,colour = season))
  
  )%>%
  layout(xaxis = list(title = 'pres'), yaxis = list(title = 'noRides'))
prcp_plot = ggplotly(
  ggplot(result_data %>% filter(prcp!=0))+
    geom_point(aes(y= noRides,x = prcp,colour = season))
  )%>%
  layout(xaxis = list(title = 'prcp'), yaxis = list(title = 'noRides'))
snow_plot = ggplotly(
  ggplot(result_data %>% filter(snow!=0))+
    geom_point(aes(y= noRides,x = snow,colour = season))
  )%>% layout(xaxis = list(title = 'snow'), yaxis = list(title = 'noRides'))

anno = list( 
  list( 
     x = 0.2,  
    y = 1.0,  
    text = "noRides from avg temp",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ),  
  list( 
     x = 0.75,  
    y = 1.0,  
    text = "noRides from avg pressure",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ),
  list( 
     x = 0.2,  
    y = 0.4,  
    text = "noRides from avg precipitation",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ),
  list( 
     x = 0.75,  
    y = 0.4,  
    text = "noRides from avg snow depth",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ))

subplot(tavg_plot,pres_plot,prcp_plot,snow_plot,nrows = 2,titleY = TRUE,titleX = TRUE,margin = 0.1) %>% layout(annotations =anno)

After looking at these plots we can barely see existing strong dependencies or clusters, but there is one conspicious fact, that only during summer number of rides falls under ~50 while other seasons have more than 50 rides.

Finding correlations

Correlation coefficients between two variables we calculate using cor() function, later anova oneway test will be used for testing to check categorical variable dependency.

best_pred <- result_data %>% ungroup() %>%
  dplyr::select(-noRides,-description ,-date,-season,-noRequests,-avgEuclidianDistance_m,-avgTravelTime_s,-wday) %>%
  map_dbl(cor,y = result_data$noRides) %>%
  #map_dbl(abs) %>%
  sort(decreasing = TRUE) 
print(best_pred)
##        wspd        wpgt        pres        snow        prcp        tmax        tavg        wdir        tmin 
##  0.13856396  0.10512816  0.05627937  0.03464468  0.02497444 -0.05060107 -0.05711341 -0.06097875 -0.06180955 
##  stringency 
## -0.56446211

As a result highest absolute score in correlation have temperature(max during a day), snow(depth in mm), pressure(air pressure in hPa), stringency(strictness of covid19 policy), weather impact(highest hours of same description a day).

Testing of null hypothesis is made for dependent noRides and independent season, description and wday

season_test = oneway.test(result_data$noRides~result_data$season)$p.value
description_test = oneway.test(result_data$noRides~result_data$description)$p.value
wday_test = oneway.test(result_data$noRides~result_data$wday)$p.value
print(c("season" =season_test,"description" =description_test,"wday" =wday_test))
##       season  description         wday 
## 9.058829e-04 5.078272e-09 9.158758e-01

From the p-value we can see that null hypothesis can only be rejected for wday (p.value higher than 0.05), for the description and season we can’t make such conclusion. But one thing that should be also tested is dependency between seasons and temperature - 1.6207717^{-112}. As we can see temperature dependent from season, so in our future model one of the variable should be dropped.

Building a model

After determination of significant variables we can build different models using different set of variables, one thing remains same - first approach is to build linear model to make predictions, also to check which predictors impact number of rides most. In our approach we will firstly larger model using all relevant variables from previous chapter, then we reduce number of variables taking smaller subset and synchronous to check how model quality changes.

data = result_data

omega_model = lm(noRides ~ tavg+pres+stringency+snow,data = data)

summary(omega_model)
## 
## Call:
## lm(formula = noRides ~ tavg + pres + stringency + snow, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.156 -17.826   0.681  16.302  72.607 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 161.25557  139.79769   1.153    0.249    
## tavg         -2.07429    0.17124 -12.114   <2e-16 ***
## pres          0.02798    0.13662   0.205    0.838    
## stringency   -1.32350    0.06102 -21.690   <2e-16 ***
## snow          0.32835    0.32657   1.005    0.315    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.06 on 508 degrees of freedom
## Multiple R-squared:  0.4839, Adjusted R-squared:  0.4798 
## F-statistic: 119.1 on 4 and 508 DF,  p-value: < 2.2e-16

Low adjusted R^2 doesn’t tells that our model doesn’t fit the data. Let’s take a look on predicted values and residuals.

colors = c("actual" = "blue","predicted" = "red","residuals" = "gray50","zerorides" = "purple")
model = omega_model
test_data = data %>% add_predictions(model = model) %>% add_residuals(model = model) %>% mutate(error = ifelse(abs(resid)>=20,"extreme","normal"))


ggplotly(ggplot(test_data %>% filter(year(date)>=2020)) +
  geom_point(aes(x = date,y = noRides,color = "actual"))+
  geom_point(aes(x = date,y = pred,color = "predicted"))+
  scale_color_manual(values = colors)+
  ggtitle("Model fit without date"))
ggplotly(ggplot(test_data %>% filter(year(date)>=2020))+
            geom_line(aes(x = date,y = resid,color = "residuals"))+
            geom_ref_line(h = 0)+
            scale_color_manual(values = colors)+
            ggtitle("Residuals"))

As we can see from our residuals plot our data has continuous growing trend, that our model doesn’t consider, so taking a date variable to model can drastically improve model “quality”.

omega_date_model = lm(noRides ~ tavg+pres+stringency+snow+date+season,data = data)

summary(omega_date_model)
## 
## Call:
## lm(formula = noRides ~ tavg + pres + stringency + snow + date + 
##     season, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -83.497  -9.960   0.405   9.885  40.942 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.319e+03  1.275e+02 -18.184  < 2e-16 ***
## tavg         -5.928e-01  1.737e-01  -3.413 0.000694 ***
## pres         -6.109e-02  8.780e-02  -0.696 0.486842    
## stringency    7.690e-02  6.778e-02   1.135 0.257084    
## snow          3.694e-02  2.074e-01   0.178 0.858719    
## date          1.320e-01  5.029e-03  26.238  < 2e-16 ***
## seasonspring -7.227e+00  2.239e+00  -3.228 0.001327 ** 
## seasonsummer -7.505e+00  2.400e+00  -3.127 0.001865 ** 
## seasonwinter -6.216e+00  2.563e+00  -2.426 0.015622 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.85 on 504 degrees of freedom
## Multiple R-squared:  0.7951, Adjusted R-squared:  0.7918 
## F-statistic: 244.4 on 8 and 504 DF,  p-value: < 2.2e-16
colors = c("actual" = "blue","predicted" = "red","residuals" = "gray50","zerorides" = "purple")
model = omega_date_model
test_data = data %>% add_predictions(model = model) %>% add_residuals(model = model) %>% mutate(error = ifelse(abs(resid)>=20,"extreme","normal"))

ggplotly(ggplot(test_data %>% filter(year(date)>=2020)) +
  geom_point(aes(x = date,y = noRides,color = "actual"))+
  geom_point(aes(x = date,y = pred,color = "predicted"))+
  scale_color_manual(values = colors)+
  ggtitle("Model fit with date"))
ggplotly(ggplot(test_data %>% filter(year(date)>=2020))+
            geom_line(aes(x = date,y = resid,color = "residuals"))+
            geom_ref_line(h = 0)+
            scale_color_manual(values = colors)+
            ggtitle("Residuals"))

As expected R^2 squared increased as well residual standard error is decreased to a number of 15,99. ~0.7 of R-squared is relatively high in weather statistics. To check correctness of used approach residuals have to be normally distributed.

Residuals histogram

barplot <- ggplot(test_data, aes(x = resid ))+
  geom_histogram(aes(y = stat(density)),colour="black", fill="white", binwidth=7)+
  ggtitle("Omega model residuals")

ggplotly(barplot)
m <- mean(test_data$resid)
s <- sd(test_data$resid)
n <- nrow(test_data)
p <- (1 : n) / n - 0.5 / n
plot1 = ggplotly(ggplot(test_data) + 
     geom_qq(aes(sample = rnorm(resid,10,4)))+
     geom_abline(intercept = 10, slope = 4,
                 color = "red", size = 1.5, alpha = 0.8))
plot2 = ggplotly(ggplot(test_data)+
  geom_point(aes(x = p, y = sort(pnorm(resid, m, s))))+
  geom_abline(
                 color = "red",size = 1.5,alpha =0.8))

anno = list( 
  list( 
     x = 0.2,  
    y = 1.0,  
    text = "Normal QQ Plot",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ),  
  list( 
     x = 0.75,  
    y = 1.0,  
    text = "Normal PP Plot",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ))

subplot(plot1,plot2) %>% layout(annotations = anno)

Reducing a model

Let’s take some variables out and see how model performs on the data and F Statistics. Low correlating variables from “Finding correlations” chapter are snow, stringency and weather_impact multiplied by description. Also there is strong correlation between average temperature and a season so our new model should contain only 1 of the predictors, because temperature have most impact on mobility and more large-scaled (season contains only 4 factors) we will take season out from new reduced_model.

reduced_1_model = lm(noRides ~ tavg+pres+stringency+snow+weather_impact*description+date,data = data)

summary(reduced_1_model)
## 
## Call:
## lm(formula = noRides ~ tavg + pres + stringency + snow + weather_impact * 
##     description + date, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.180  -9.343   0.017   9.964  37.716 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -2.224e+03  1.458e+02 -15.256  < 2e-16 ***
## tavg                             -7.192e-01  1.442e-01  -4.989 9.31e-07 ***
## pres                             -1.887e-01  1.073e-01  -1.759 0.079374 .  
## stringency                       -2.854e-01  7.940e-02  -3.595 0.000368 ***
## snow                              8.378e-02  2.606e-01   0.321 0.748064    
## weather_impact                   -2.558e-02  3.172e+00  -0.008 0.993571    
## descriptionCloudy                -9.663e-01  3.338e+01  -0.029 0.976922    
## descriptionHeavy                 -6.517e+00  3.571e+01  -0.182 0.855308    
## descriptionLight                 -3.893e+00  3.338e+01  -0.117 0.907219    
## descriptionMedium                -6.655e+00  3.868e+01  -0.172 0.863504    
## descriptionSunny                 -1.504e+01  3.725e+01  -0.404 0.686565    
## date                              1.353e-01  5.239e-03  25.822  < 2e-16 ***
## weather_impact:descriptionCloudy -3.218e-01  3.183e+00  -0.101 0.919538    
## weather_impact:descriptionHeavy   3.600e-01  3.448e+00   0.104 0.916909    
## weather_impact:descriptionLight  -5.373e-01  3.195e+00  -0.168 0.866521    
## weather_impact:descriptionMedium -3.276e-01  3.959e+00  -0.083 0.934093    
## weather_impact:descriptionSunny   6.436e-01  3.517e+00   0.183 0.854924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.08 on 375 degrees of freedom
##   (121 observations deleted due to missingness)
## Multiple R-squared:  0.7713, Adjusted R-squared:  0.7616 
## F-statistic: 79.06 on 16 and 375 DF,  p-value: < 2.2e-16
colors = c("actual" = "blue","predicted" = "red","residuals" = "gray50","zerorides" = "purple")
model = reduced_1_model
test_data = data %>% add_predictions(model = model) %>% add_residuals(model = model) %>% mutate(error = ifelse(abs(resid)>=20,"extreme","normal"))

As we can see excluding variables from model doesn’t make it worse residual st. error remains near 16.32, we can also use anova test function from stats package to check model variable significance. But firstly we would like to get best best model with possible minimum variables. For this purpose we can inspect p-value from model summary, it says how significant separate variable in model predictions, less p-value - more significant. So we would like to exclude snow, and combined term ow weather_impact and description out of model.

As we can see removing snow and separation of weather_impact and description, increased model quality with reducing residual standard error. Anova test between these two models also doesn’t show significant difference 0.983505 because p-value is 0.85>>0.05.

After few iterations we get to a model that contains only strictness of the policy restrictions, temperature and data. Represented model explains also explains most of the observations with relatively low residual standard error, but it is minimal for excluding, because anova significance difference between models without any of remaining predictors has p-value<0.05. So we get to our final model, that explains number of KeXi Rides using temperature, covid strictness level and date(positive growing trend).

reduced_3_model = lm(noRides ~ season+tavg+stringency+date,data = data)

summary(reduced_3_model)
## 
## Call:
## lm(formula = noRides ~ season + tavg + stringency + date, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -83.387  -9.874   0.308  10.182  40.898 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.377e+03  9.710e+01 -24.483  < 2e-16 ***
## seasonspring -7.163e+00  2.233e+00  -3.208 0.001419 ** 
## seasonsummer -7.503e+00  2.393e+00  -3.135 0.001817 ** 
## seasonwinter -5.912e+00  2.523e+00  -2.343 0.019497 *  
## tavg         -5.716e-01  1.693e-01  -3.377 0.000788 ***
## stringency    7.516e-02  6.758e-02   1.112 0.266588    
## date          1.317e-01  5.001e-03  26.337  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.83 on 506 degrees of freedom
## Multiple R-squared:  0.7949, Adjusted R-squared:  0.7924 
## F-statistic: 326.8 on 6 and 506 DF,  p-value: < 2.2e-16
colors = c("actual" = "blue","predicted" = "red","residuals" = "gray50","zerorides" = "purple")
model = reduced_3_model
test_data = data %>% add_predictions(model = model) %>% add_residuals(model = model) %>% mutate(error = ifelse(abs(resid)>=20,"extreme","normal"))


ggplotly(ggplot(test_data %>% filter(year(date)>=2020)) +
  geom_point(aes(x = date,y = noRides,color = "actual"))+
  geom_point(aes(x = date,y = pred,color = "predicted"))+
  scale_color_manual(values = colors)+
  ggtitle("Final model fit"))
ggplotly(ggplot(test_data %>% filter(year(date)>=2020))+
            geom_line(aes(x = date,y = resid,color = "residuals"))+
            geom_ref_line(h = 0)+
            scale_color_manual(values = colors)+
            ggtitle("Residuals"))

After performing a minimal model we also want to check once again how residuals are distributed and some additional statistics metrics.

barplot <- ggplot(test_data, aes(x = resid ))+
  geom_histogram(aes(y = stat(density)),colour="black", fill="white", binwidth=9)+
  ggtitle("Final model residuals")

ggplotly(barplot)

We see normal distributed variable with skewness to the legt with 3 outliers at the left (analysis of individual days shows, that extremely low number of rides were when operator collecting data has changed so this is data specific problem that could be ignored or excluded from model)

test_data = test_data %>% filter(resid>=-50)
m <- mean(test_data$resid)
s <- sd(test_data$resid)
n <- nrow(test_data)
p <- (1 : n) / n - 0.5 / n
plot1 = ggplotly(ggplot(test_data) + 
     geom_qq(aes(sample = rnorm(resid,10,4)))+
     geom_abline(intercept = 10, slope = 4,
                 color = "red", size = 1.5, alpha = 0.8))
plot2 = ggplotly(ggplot(test_data)+
  geom_point(aes(x = p, y = sort(pnorm(resid, m, s))))+
  geom_abline(
                 color = "red",size = 1.5,alpha =0.8))

anno = list( 
  list( 
     x = 0.2,  
    y = 1.0,  
    text = "Normal QQ Plot",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ),  
  list( 
     x = 0.75,  
    y = 1.0,  
    text = "Normal PP Plot",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ))

subplot(plot1,plot2) %>% layout(annotations = anno)

So our residuals are normally distributed :)

Excluding more variables from a model doesn’t impove model as well increases residual error enormous, so we can make conclusion, that represented model is minimal fitting possible with variables that have most impact to number of rides by the weather. Remains only to check residuals of fitted model.

Conclusion

Performed LINEAR regression analysis shows, that there no strong correlations between number of rides and weather characteristics except of temperature, and temperature is also quite dependent from season, one difference between using season instead of the temperature is that relative temp changes in same season isn’t inspected, so it can be that relative fall of the temperature in summer also strongly impacts mobility. Also data was collected during covid-19 pandemie, so stringency-strictness is also significant by building an model. In the end date is used in the model, because there are positive growing trend with using KeXi ride service.